EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: EDA)
df_dataframe_name <- WDI(indicators = c(name1 = "Indicator Code 1",
name2 = "Indicator Code 2"), extra = TRUE)
Write and read:
write_csv(df_dataframe_name, "data/dataframe_name.csv")
df_dataframe_name <- read_csv("data/dataframe_name.csv")
head(), str(), summary(), and
try df_dataframe_name. See also Environment Tab of
RStudio.
df_dataframe_name |> filter(var == "value")
df_dataframe_name |> filter(var %in% c("value_1", ... , "value_n")
df_dataframe_name |> filter(var != "value")
df_dataframe_name |> drop_na(var)
df_dataframe_name |> mutate(var_new = var1 * var2)}
arrange()df_dataframe_name |> arrange(var)
df_dataframe_name |> arrange(desc(var))
Visualizing using ggplot() + geom_*()
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
transformed_data |> ggplot(aes(year, name1)) + geom_line()
transformed_data |> ggplot(aes(year, name2)) + geom_line()
transformed_data |> ggplot(aes(name1)) + geom_histogram()
categorical_var: factor(year),
income, region
transformed_data |> ggplot(aes(categorical_var, name1)) + geom_boxplot()
transformed_data |> ggplot(aes(name1, name2)) + geom_point()
transformed_data |> ggplot(aes(name1, name2)) + geom_point() + scale_x_log10()
transformed_data |> ggplot(aes(name1, name2)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)
transformed_data |> ggplot(aes(name1, name2)) + geom_point() +
geom_smooth(method = "lm", se = FALSE) + scale_x_log10()
An Example of EDA
Abstract. We study the relation between the CO2 emission per capita and the GDP per capita using two World Development Indicators.
library(tidyverse)
library(WDI)
If you do not have wdicache.rds in your data folder,
run the following two code chunks.
wdicache <- WDIcache()
WDIsearch with short = FALSE provides
with the description of data in the last column.
WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator",
short = FALSE, cache = wdicache)
Description: Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring.
WDIsearch(string = "NY.GDP.PCAP.PP.KD", field = "indicator",
short = FALSE, cache = wdicache)
Descritption: GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars.
For extra information with extra = TRUE, we provide
updated information with cache = wdicache.
df_co2gdp <- WDI(indicator = c(co2pcap = "EN.ATM.CO2E.PC",
gdppcap = "NY.GDP.PCAP.PP.KD"),
extra = TRUE, cache = wdicache)
write_csv(df_co2gdp, "data/co2gdp.csv")
df_co2gdp <- read_csv("data/co2gdp.csv")
Rows: 16758 Columns: 14── Column specification ─────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, co2pcap, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Three ways to look at the table. df_co2gdp ,
head(df_co2gdp) , df_co2gdp under Environment tab in the
top right pane.
df_co2gdp
If you do not add cache = wdicache when you download
data with extra = TRUE, region and income of these
countries would be NA as the country information attached to the package
requires update using WDIcache(). As for Czechia
and Viet Nam, see the file [Link]
and its Addendum.
df_co2gdp |> filter(country %in% c("Czechia", "Viet Nam")) |>
distinct(country, region, income)
glimpse(df_co2gdp) does about the same.
str(df_co2gdp)
spc_tbl_ [16,758 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ iso2c : chr [1:16758] "AF" "AF" "AF" "AF" ...
$ iso3c : chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
$ year : num [1:16758] 2012 2008 2009 2004 2011 ...
$ status : logi [1:16758] NA NA NA NA NA NA ...
$ lastupdated: Date[1:16758], format: "2023-12-18" "2023-12-18" ...
$ co2pcap : num [1:16758] 0.3351 0.1656 0.2395 0.0549 0.409 ...
$ gdppcap : num [1:16758] 2123 1557 1824 1260 1961 ...
$ region : chr [1:16758] "South Asia" "South Asia" "South Asia" "South Asia" ...
$ capital : chr [1:16758] "Kabul" "Kabul" "Kabul" "Kabul" ...
$ longitude : num [1:16758] 69.2 69.2 69.2 69.2 69.2 ...
$ latitude : num [1:16758] 34.5 34.5 34.5 34.5 34.5 ...
$ income : chr [1:16758] "Low income" "Low income" "Low income" "Low income" ...
$ lending : chr [1:16758] "IDA" "IDA" "IDA" "IDA" ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. iso2c = col_character(),
.. iso3c = col_character(),
.. year = col_double(),
.. status = col_logical(),
.. lastupdated = col_date(format = ""),
.. co2pcap = col_double(),
.. gdppcap = col_double(),
.. region = col_character(),
.. capital = col_character(),
.. longitude = col_double(),
.. latitude = col_double(),
.. income = col_character(),
.. lending = col_character()
.. )
- attr(*, "problems")=<externalptr>
df_co2gdppcap <- df_co2gdp |> select(country, iso2c, year, co2pcap, gdppcap, region, income)
Check the distinct region names under the column region or income.
unique(df_co2gdppcap$region)
[1] "South Asia" "Aggregates"
[3] "Europe & Central Asia" "Middle East & North Africa"
[5] "East Asia & Pacific" "Sub-Saharan Africa"
[7] "Latin America & Caribbean" "North America"
[9] NA
The following does the same.
df_co2gdppcap |> distinct(region) |> pull()
[1] "South Asia" "Aggregates"
[3] "Europe & Central Asia" "Middle East & North Africa"
[5] "East Asia & Pacific" "Sub-Saharan Africa"
[7] "Latin America & Caribbean" "North America"
[9] NA
Using dput() may be handy if you want to use the output.
unique(df_co2gdppcap$income) |> dput()
c("Low income", "Aggregates", "Upper middle income", "Lower middle income",
"High income", NA, "Not classified")
INCOME <-c("Low income", "Lower middle income", "Upper middle income",
"High income")
It is possible to get the information in one code chunk.
df_co2gdppcap |> select(region, income) |> lapply(unique)
$region
[1] "South Asia" "Aggregates"
[3] "Europe & Central Asia" "Middle East & North Africa"
[5] "East Asia & Pacific" "Sub-Saharan Africa"
[7] "Latin America & Caribbean" "North America"
[9] NA
$income
[1] "Low income" "Aggregates" "Upper middle income"
[4] "Lower middle income" "High income" NA
[7] "Not classified"
It is convenient to have a list at hand.
wdicache$country |> filter(region == "Aggregates") |>
distinct(country, iso2c)
Compare the following with above.
df_co2gdppcap |> filter(region == "Aggregates") |>
distinct(country, iso2c)
df_co2gdppcap |> filter(is.na(region)) |>
distinct(country, iso2c)
wdicache$country |> filter(region != "Aggregates") |>
distinct(country, iso2c, region, income) |> arrange(country)
df_co2gdppcap |> filter(region != "Aggregates") |>
distinct(country, iso2c, region, income) |> arrange(country)
Observations:
Check whether you have enough data in each year.
df_co2gdppcap |> drop_na(co2pcap, gdppcap) |>
ggplot(aes(year)) + geom_bar()
Observation:
The code above is same as the following.
df_co2gdppcap |> drop_na(co2pcap, gdppcap) |>
group_by(year) |> summarize(n = n()) |>
ggplot(aes(year, n)) + geom_col()
COUNTRY <- "World"
df_co2gdppcap |> filter(country == COUNTRY) |> drop_na(co2pcap) |>
ggplot(aes(year, co2pcap)) + geom_line() +
labs(title = expression(paste(CO[2], " per capita of the World")),
y = expression(paste(CO[2], " per capita in tons")))
Observations:
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_co2gdppcap |> filter(iso2c %in% ISO2C) |> drop_na(co2pcap) |>
ggplot(aes(year, co2pcap, col = iso2c)) + geom_line() +
labs(title = expression(paste(CO[2], " per capita of seven conutries with large GDP")),
subtitle = "China, Germany, France, United Kingdom, India, Japan, United States",
y = expression(paste(CO[2], " per capita in tons")))
Observations:
COUNTRY <- "World"
df_co2gdppcap |> filter(country == COUNTRY) |> drop_na(gdppcap) |>
ggplot(aes(year, gdppcap)) + geom_line() +
labs(title = "GDP per capita of the World")
Observations:
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_co2gdppcap |> filter(iso2c %in% ISO2C) |> drop_na(gdppcap) |>
ggplot(aes(year, gdppcap, col = iso2c)) + geom_line() +
labs(title = "GDP per capita of seven countries with large GDP",
subtitle = "China, Germany, France, United Kingdom, India, Japan, United States",
y = "GDP per capita PPP",
caption = "constant 2017 international usd")
Observations:
df_co2gdppcap |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> arrange(desc(co2pcap))
Observations:
df_co2gdppcap |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> arrange(co2pcap)
Observations and Questions:
Top 10 countries of CO2 emission per capita:
Lowest 10 countries of CO2 emission per capita:
The top 10 is about 1000 times the bottom 10.
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates") |>
drop_na(gdppcap) |> arrange(desc(gdppcap))
Observations:
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates") |>
drop_na(gdppcap) |> arrange(gdppcap)
Observations:
Change the bins or binwidth.
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates") |> drop_na(co2pcap) |>
ggplot(aes(co2pcap)) + geom_histogram()
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates") |> drop_na(co2pcap) |>
ggplot(aes(co2pcap)) + geom_histogram(bins = 10)
Observations:
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates") |> drop_na(co2pcap) |>
ggplot(aes(co2pcap, fill = region)) + geom_histogram(bins = 15, col = "black", linewidth = 0.2)
Observations:
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates") |> drop_na(co2pcap) |>
ggplot(aes(co2pcap, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.2) +
labs(fill = "")
Observations and Questions:
df_co2gdppcap |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(income != "Not classified") |>
ggplot(aes(co2pcap, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() +
labs(title = "Histogram of CO2 per capita in 2020", fill = "")
Observations:
Each of log10 scale and the raw value seems to tell different feature.
Need to consider NA value in income.
df_co2gdppcap |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(co2pcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(co2pcap, fill = factor(year))) +
geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() + facet_wrap(~year) +
labs(title = "Histogram of CO2 per capita in 1990, 2000, 2010, 2020", fill = "")
Observations:
df_co2gdppcap |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(co2pcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(co2pcap, factor(year), fill = factor(year))) +
geom_boxplot() + scale_x_log10() + labs(y = "") + theme(legend.position = "none")
Observations:
df_co2gdppcap |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(co2pcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(co2pcap, factor(income, levels = INCOME), fill = income)) +
geom_boxplot() + scale_x_log10() +
labs(title = "CO2 per capita by income level", y = "", fill = "") +
theme(legend.position = "none")
Observations:
df_co2gdppcap |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(co2pcap) |> filter(co2pcap > 0) |>
ggplot(aes(co2pcap, region, fill = region)) +
geom_boxplot() + scale_x_log10() +
labs(title = "CO2 per capita by region", y = "", fill = "") +
theme(legend.position = "none")
Observations:
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates", !is.na(income)) |>
drop_na(gdppcap) |> ggplot(aes(gdppcap, factor(income, levels = INCOME), fill = income)) + geom_boxplot() + scale_x_log10() + theme(legend.position = "none") + labs(y = "")
Observations:
There are overlaps. Need to study how the income level is determined by World Bank.
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates", !is.na(region)) |>
drop_na(gdppcap) |>
ggplot(aes(gdppcap, fill = region)) + geom_histogram(bins = 15, col = "black", linewidth = 0.2) +
labs(fill = "")
Observations:
df_co2gdppcap |> filter(year == 2020) |>
filter(region != "Aggregates", !is.na(region)) |>
drop_na(gdppcap) |>
ggplot(aes(gdppcap, fill = region)) + geom_histogram(bins = 15, col = "black", linewidth = 0.2) + scale_x_log10() +
labs(fill = "")
Observations:
df_co2gdppcap |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(income != "Not classified") |>
ggplot(aes(gdppcap, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() +
labs(title = "Histogram of GDP per capita in 2020", fill = "")
Observations:
df_co2gdppcap |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(gdppcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(gdppcap, fill = factor(year))) +
geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() + facet_wrap(~year) +
labs(title = "Histogram of GDP per capita in 1990, 2000, 2010, 2020", fill = "") +
theme(legend.position = "none")
Observations:
df_co2gdppcap |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(gdppcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(gdppcap, factor(year), fill = factor(year))) +
geom_boxplot() + scale_x_log10() + labs(y = "") + theme(legend.position = "none")
Observations:
df_co2gdppcap |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(gdppcap > 0) |> filter(income != "Not classified") |>
ggplot(aes(gdppcap, factor(income, levels = INCOME), fill = income)) +
geom_boxplot() + scale_x_log10() +
labs(title = "GDP per capita by income level", y = "", fill = "") +
theme(legend.position = "none")
Observations:
df_co2gdppcap |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(gdppcap) |> filter(gdppcap > 0) |>
ggplot(aes(gdppcap, region, fill = region)) +
geom_boxplot() + scale_x_log10() +
labs(title = "GDP per capita by region", y = "", fill = "") +
theme(legend.position = "none")
Observations:
df_co2gdppcap |> filter(year == 2020) |>
drop_na(gdppcap, co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point(aes(col = region)) +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10() + scale_y_log10() +
labs(title = "GDP per capita vs CO2 per capita",
x = "GDP per capita",
y = expression(paste(CO[2], " per capita in tons")))
Observations:
You will learn how to interprete the values below later in this course.
df_co2gdppcap |> filter(year == 2020) |> drop_na(gdppcap, co2pcap) |>
lm(log10(co2pcap)~log10(gdppcap), data = _) |> summary()
Call:
lm(formula = log10(co2pcap) ~ log10(gdppcap), data = drop_na(filter(df_co2gdppcap,
year == 2020), gdppcap, co2pcap))
Residuals:
Min 1Q Median 3Q Max
-0.68315 -0.15741 -0.00445 0.16201 0.59504
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.31720 0.13473 -32.04 <2e-16 ***
log10(gdppcap) 1.13857 0.03309 34.41 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2375 on 228 degrees of freedom
Multiple R-squared: 0.8385, Adjusted R-squared: 0.8378
F-statistic: 1184 on 1 and 228 DF, p-value: < 2.2e-16
Observations:
In log10 scale, the regression line fits well with slope 1.1.
Multiple R-square is 0.84 and the linear regression explains good part of the relation of these data.
We study …..
library(tidyverse)
library(WDI)
Create data folder if you do not have it under Files.
dir.create("data")
If you do not have wdicache.rds in your data folder,
run the following two code chunks.
wdicache <- WDIcache()
chosen_indicator_1 <- "SE.SEC.ENRR" #example
short_name_1 <- "sec"
chosen_indicator_2 <- "NY.GDP.PCAP.PP.KD" #example
short_name_2 <- "gdppcap"
df_yourdata <- WDI(indicator = c(short_name_1 = chosen_indicator_1,
short_name_2 = chosen_indicator_2),
extra = TRUE, cache = wdicache)
write_csv(df_yourdata, "data/yourdata.csv")
df_yourdata <- read_csv("data/yourdata.csv")
Rows: 16758 Columns: 14── Column specification ─────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, short_name_1, short_name_2, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
COUNTRY <- "World"
df_yourdata |> filter(country == COUNTRY) |> drop_na(short_name_1) |>
ggplot(aes(year, short_name_1)) + geom_line() +
labs(title = "",
y = "")
Observations and Questions:
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_yourdata |> filter(iso2c %in% ISO2C) |> drop_na(short_name_1) |>
ggplot(aes(year, short_name_1, col = iso2c)) + geom_line() +
labs(title = "",
subtitle = "China, Germany, France, United Kingdom, India, Japan, United States",
y = "")
Observations and Questions:
COUNTRY <- "World"
df_yourdata |> filter(country == COUNTRY) |> drop_na(short_name_2) |>
ggplot(aes(year, short_name_2)) + geom_line() +
labs(title = "")
Observations and Questions:
ISO2C <- c("JP", "CN", "ID", "GB", "US", "DE", "FR")
df_yourdata |> filter(iso2c %in% ISO2C) |> drop_na(short_name_2) |>
ggplot(aes(year, short_name_2, col = iso2c)) + geom_line() +
labs(title = "",
subtitle = "China, Germany, France, United Kingdom, India, Japan, United States",
y = "",
caption = "")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> arrange(desc(short_name_1))
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> arrange(short_name_1)
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> arrange(desc(short_name_2))
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> arrange(short_name_2)
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(income != "Not classified") |>
ggplot(aes(short_name_1, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() +
labs(title = "", fill = "")
Observations and Questions:
df_yourdata |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(short_name_1 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_1, fill = factor(year))) +
geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() + facet_wrap(~year) +
labs(title = "", fill = "")
Observations and Questions:
df_yourdata |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(short_name_1 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_1, factor(year), fill = factor(year))) +
geom_boxplot() + scale_x_log10() + labs(y = "") + theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(short_name_1 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_1, factor(income, levels = INCOME), fill = income)) +
geom_boxplot() + scale_x_log10() +
labs(title = "", y = "", fill = "") +
theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_1) |> filter(short_name_1 > 0) |>
ggplot(aes(short_name_1, region, fill = region)) +
geom_boxplot() + scale_x_log10() +
labs(title = "", y = "", fill = "") +
theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(income != "Not classified") |>
ggplot(aes(short_name_2, fill = factor(income, levels = INCOME))) + geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() +
labs(title = "", fill = "")
df_yourdata |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(short_name_2 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_2, fill = factor(year))) +
geom_histogram(bins = 15, col = "black", linewidth = 0.1) +
scale_x_log10() + facet_wrap(~year) +
labs(title = "", fill = "") +
theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year %in% c(1990, 2000, 2010, 2020)) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(short_name_2 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_2, factor(year), fill = factor(year))) +
geom_boxplot() + scale_x_log10() + labs(y = "") + theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(short_name_2 > 0) |> filter(income != "Not classified") |>
ggplot(aes(short_name_2, factor(income, levels = INCOME), fill = income)) +
geom_boxplot() + scale_x_log10() +
labs(title = "", y = "", fill = "") +
theme(legend.position = "none")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> filter(region != "Aggregates") |>
drop_na(short_name_2) |> filter(short_name_2 > 0) |>
ggplot(aes(short_name_2, region, fill = region)) +
geom_boxplot() + scale_x_log10() +
labs(title = "", y = "", fill = "") +
theme(legend.position = "none")
df_yourdata |> filter(year == 2020) |>
drop_na(short_name_2, short_name_1) |>
ggplot(aes(short_name_2, short_name_1)) + geom_point(aes(col = region)) +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10() + scale_y_log10() +
labs(title = "",
x = "",
y = "")
Observations and Questions:
df_yourdata |> filter(year == 2020) |> drop_na(short_name_2, short_name_1) |>
lm(log10(short_name_1)~log10(short_name_2), data = _) |> summary()
Call:
lm(formula = log10(short_name_1) ~ log10(short_name_2), data = drop_na(filter(df_yourdata,
year == 2020), short_name_2, short_name_1))
Residuals:
Min 1Q Median 3Q Max
-0.279981 -0.058887 0.003311 0.063366 0.246949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.81316 0.06423 12.66 <2e-16 ***
log10(short_name_2) 0.26533 0.01532 17.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08889 on 172 degrees of freedom
Multiple R-squared: 0.6356, Adjusted R-squared: 0.6335
F-statistic: 300.1 on 1 and 172 DF, p-value: < 2.2e-16
Observations and Questions: